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Abstract — Recently, the SPARQL query language for RDF 
has reached the W3C recommendation status. In response to 
this emerging standard, the database community is currently 
exploring efficient storage techniques for RDF data and evalua- 
tion strategies for SPARQL queries. A meaningful analysis and 
comparison of these approaches necessitates a comprehensive and 
universal benchmark platform. To this end, we have developed 
SP 2 Bench, a publicly available, language-specific SPARQL per- 
formance benchmark. SP 2 Bench is settled in the DBLP scenario 
and comprises both a data generator for creating arbitrarily large 
DBLP-like documents and a set of carefully designed benchmark 
queries. The generated documents mirror key characteristics and 
social-world distributions encountered in the original DBLP data 
set, while the queries implement meaningful requests on top of 
this data, covering a variety of SPARQL operator constellations 
and RDF access patterns. As a proof of concept, we apply 
SP 2 Bench to existing engines and discuss their strengths and 
weaknesses that follow immediately from the benchmark results. 

I. Introduction 

The Resource Description Framework [1] (RDF) has be- 
come the standard format for encoding machine-readable 
information in the Semantic Web [2]. RDF databases can 
be represented by labeled directed graphs, where each edge 
connects a so-called subject node to an object node under label 
predicate. The intended semantics is that the object denotes 
the value of the subject's, property predicate. Supplementary to 
RDF, the W3C has recommended the declarative SPARQL [3] 
query language, which can be used to extract information 
from RDF graphs. SPARQL bases upon a powerful graph 
matching facility, allowing to bind variables to components in 
the input RDF graph. In addition, operators akin to relational 
joins, unions, left outer joins, selections, and projections can 
be combined to build more expressive queries. 

By now, several proposals for the efficient evaluation of 
SPARQL have been made. These approaches comprise a wide 
range of optimization techniques, including normal forms [4], 
graph pattern reordering based on selectivity estimations [5] 
(similar to relational join reordering), syntactic rewriting [6], 
specialized indices [7], [8] and storage schemes [9], [10], [11], 
[12], [13] for RDF, and Semantic Query Optimization [14]. 
Another viable option is the translation of SPARQL into 
SQL [15], [16] or Datalog [17], which facilitates the evaluation 
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with traditional engines, thus falling back on established 
optimization techniques implemented in conventional engines. 

As a proof of concept, most of these approaches have 
been evaluated experimentally either in user-defined scenarios, 
on top of the LUBM benchmark [18], or using the Barton 
Library benchmark [19]. We claim that none of these sce- 
narios is adequate for testing SPARQL implementations in a 
general and comprehensive way: On the one hand, user-defined 
scenarios are typically designed to demonstrate very specific 
properties and, for this reason, lack generality. On the other 
hand, the Barton Library Benchmark is application-oriented, 
while LUBM was primarily designed to test the reasoning 
and inference mechanisms of Knowledge Base Systems. As a 
trade-off, in both benchmarks central SPARQL operators like 
Optional and Union, or solution modifiers are not covered. 

With the SPARQL Performance Benchmark (SP 2 Bench) 
we propose a language-specific benchmark framework specif- 
ically designed to test the most common SPARQL constructs, 
operator constellations, and a broad range of RDF data access 
patterns. The SP 2 Bench data generator and benchmark queries 
are available for download in a ready-to-use format. 1 

In contrast to application-specific benchmarks, SP 2 Bench 
aims at a comprehensive performance evaluation, rather than 
assessing the behavior of engines in an application-driven 
scenario. Consequently, it is not motivated by a single use case, 
but instead covers a broad range of challenges that SPARQL 
engines might face in different contexts. In this line, it allows 
to assess the generality of optimization approaches and to 
compare them in a universal, application-independent setting. 
We argue that, for these reasons, our benchmark provides 
excellent support for testing the performance of engines in 
a comprising way, which might help to improve the quality of 
future research in this area. We emphasize that such language- 
specific benchmarks (e.g., XMark [20]) have found broad 
acceptance, in particular in the research community. 

It is quite evident that the domain of a language-specific 
benchmark should not only constitute a representative scenario 
that captures the philosophy behind the data format, but also 
leave room for challenging queries. With the choice of the 
DBLP [21] library we satisfy both desiderata. First, RDF has 
been particularly designed to encode metadata, which makes 
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DBLP an excellent candidate. Furthermore, DBLP reflects 
interesting social-world distributions (cf. [22]), and hence 
captures the social network character of the Semantic Web, 
whose idea is to integrate a great many of small databases 
into a global semantic network. In this line, it facilitates the 
design of interesting queries on top of these distributions. 

Our data generator supports the creation of arbitrarily large 
DBLP-like models in RDF format, which mirror vital key 
characteristics and distributions of DBLP. Consequently, our 
framework combines the benefits of a data generator for 
creating arbitrarily large documents with interesting data that 
contains many real-world characteristics, i.e. mimics natural 
correlations between entities, such as power law distributions 
(found in the citation system or the distribution of papers 
among authors) and limited growth curves (e.g., the increasing 
number of venues and publications over time). For this reason 
our generator relies on an in-depth study of DBLP, which 
comprises the analysis of entities (e.g. articles and authors), 
their properties, frequency, and also their interaction. 

Complementary to the data generator, we have de- 
signed 17 meaningful queries that operate on top of the 
generated documents. They cover not only the most important 
SPARQL constructs and operator constellations, but also vary 
in their characteristics, such as complexity and result size. The 
detailed knowledge of data characteristics plays a crucial role 
in query design and makes it possible to predict the challenges 
that the queries impose on SPARQL engines. This, in turn, 
facilitates the interpretation of benchmark results. 

The key contributions of this paper are the following. 

• We present SP 2 Bench, a comprehensive benchmark for 
the SPARQL query language, comprising a data generator 
and a collection of 17 benchmark queries. 

• Our generator supports the creation of arbitrarily large 
DBLP documents in RDF format, reflecting key charac- 
teristics and social-world relations found in the original 
DBLP database. The generated documents cover various 
RDF constructs, such as blank nodes and containers. 

• The benchmark queries have been carefully designed 
to test a variety of operator constellations, data access 
patterns, and optimization strategies. In the exhaustive 
discussion of these queries we also highlight the specific 
challenges they impose on SPARQL engines. 

• As a proof of concept, we apply SP 2 Bench to selected 
SPARQL engines and discuss their strengths and weak- 
nesses that follow from the benchmark results. This 
analysis confirms that our benchmark is well-suited to 
identify deficiencies in SPARQL implementations. 

> We finally propose performance metrics that capture 
different aspects of the evaluation process. 

Outline. We next discuss related work and design decisions 
in Section [ID The analysis of DBLP in Section [ill] forms the 
basis for our data generator in Section HVl Section [V] gives an 
introduction to SPARQL and describes the benchmark queries. 
The experiments in Section [VI] comprise a short evaluation 
of our generator and benchmark results for existing SPARQL 
engines. We conclude with some final remarks in Section [VTT1 



II. Benchmark Design Decisions 

Benchmarking. The Benchmark Handbook [23] provides 
a summary of important database benchmarks. Probably the 
most "complete" benchmark suite for relational systems is 
TPC 2 , which defines performance and correctness benchmarks 
for a large variety of scenarios. There also exists a broad range 
of benchmarks for other data models, such as object-oriented 
databases (e.g., 007 [24]) and XML (e.g., XMark [20]). 

Coming along with its growing importance, different bench- 
marks for RDF have been developed. The Lehigh University 
Benchmark [18] (LUBM) was designed with focus on infer- 
ence and reasoning capabilities of RDF engines. However, the 
SPARQL specification [3] disregards the semantics of RDF 
and RDFS [25], [26], i.e. does not involve automated reasoning 
on top of RDFS constructs such as subclass and subproperty 
relations. With this regard, LUBM does not constitute an 
adequate scenario for SPARQL performance evaluation. This 
is underlined by the fact that central SPARQL operators, such 
as Union and Optional, are not addressed in LUBM. 

The Barton Library benchmark [19] queries implement a 
user browsing session through the RDF Barton online catalog. 
By design, the benchmark is application-oriented. All queries 
are encoded in SQL, assuming that the RDF data is stored in 
a relational DB. Due to missing language support for aggrega- 
tion, most queries cannot be translated into SPARQL. On the 
other hand, central SPARQL features like left outer joins (the 
relational equivalent of SPARQL operator OPTIONAL) and 
solution modifiers are missing. In summary, the benchmark 
offers only limited support for testing native SPARQL engines. 

The application-oriented Berlin SPARQL Benchmark [27] 
(BSBM) tests the performance of SPARQL engines in a pro- 
totypical e-commerce scenario. BSBM is use-case driven and 
does not particularly address language-specific issues. With its 
focus, it is supplementary to the SP 2 Bench framework. 

The RDF(S) data model benchmark in [28] focuses on 
structural properties of RDF Schemas. In [29] graph features 
of RDF Schemas are studied, showing that they typically 
exhibit power law distributions which constitute a valuable 
basis for synthetic schema generation. With their focus on 
schemas, both [28] and [29] are complementary to our work. 

A synthetic data generation approach for OWL based on 
test data is described in [30]. There, the focus is on rapidly 
generating large data sets from representative data of a fixed 
domain. Our data generation approach is more fine-grained, as 
we analyze the development of entities (e.g. articles) over time 
and reflect many characteristics found in social communities. 

Design Principles. In the Benchmark Handbook [23], four 
key requirements for domain specific benchmarks are pos- 
tulated, i.e. it should be (1) relevant, thus testing typical 
operations within the specific domain, (2) portable, i.e. should 
be executable on different platforms, (3) scalable, e.g. it should 
be possible to run the benchmark on both small and very large 
data sets, and last but not least (4) it must be understandable. 

2 See http://www.tpc.org. 



For a language-specific benchmark, the relevance require- 
ment (1) suggests that queries implement realistic requests 
on top of the data. Thereby, the benchmark should not 
focus on correctness verification, but on common operator 
constellations that impose particular challenges. For instance, 
two SP 2 Bench queries test negation, which (under closed- 
world assumption) can be expressed in SPARQL through a 
combination of operators OPTIONAL, FILTER, and BOUND. 

Requirements (2) portability and (3) scalability bring along 
technical challenges concerning the implementation of the data 
generator. In response, our data generator is deterministic, 
platform independent, and accurate w.r.t. the desired size of 
generated documents. Moreover, it is very efficient and gets by 
with a constant amount of main memory, and hence supports 
the generation of arbitrarily large RDF documents. 

From the viewpoint of engine developers, a benchmark 
should give hints on deficiencies in design and implementa- 
tion. This is where (4) understandability comes into play, i.e. it 
is important to keep queries simple and understandable. At the 
same time, they should leave room for diverse optimizations. 
In this regard, the queries are designed in such a way that they 
are amenable to a wide range of optimization strategies. 

DBLP. We settled SP 2 Bench in the DBLP [21] scenario. 
The DBLP database contains bibliographic information about 
the field of Computer Science and, particularly, databases. 

In the context of semi-structured data one often dis- 
tinguishes between data- and document-centric scenarios. 
Document-centric design typically involves large amounts of 
free-form text, while data-centric documents are more struc- 
tured and usually processed by machines rather than humans. 
RDF has been specifically designed for encoding information 
in a machine-readable way, so it basically follows the data- 
centric approach. DBLP, which contains structured data and 
little free text, constitutes such a data-centric scenario. 

As discussed in the Introduction, our generator mirrors vital 
real-world distributions found in the original DBLP data. This 
constitutes an improvement over existing generators that create 
purely synthetic data, in particular in the context of a language- 
specific benchmark. Ultimately, our generator might also be 
useful in other contexts, whenever large RDF test data is 
required. We point out that the DBLP-to-RDF translation of 
the original DBLP data in [31] provides only a fixed amount 
of data and, for this reason, is not sufficient for our purpose. 

We finally mention that sampling down large, existing data 
sets such as U.S. Census 3 (about 1 billion triples) might 
be another reasonable option to obtain data with real-world 
characteristics. The disadvantage, however, is that sampling 
might destroy more complex distributions in the data, thus 
leading to unnatural and "corrupted" RDF graphs. In contrast, 
our decision to build a data generator from scratch allows us to 
customize the structure of the RDF data, which is in line with 
the idea of a comprehensive, language-specific benchmark. 
This way, we easily obtain documents that contain a rich set 
of RDF constructs, such as blank nodes or containers. 

3 http://www.rdfabout.com/demo/census/ 



<! ELEMENT dblp 

(article I inproceedings I proceedings I book 
incollection | phdthesis I mastersthesis | www) *> 
< ! ENTITY % field 

"author | editor I title I booktitle I pages I year | address 
j ournal I volume I number | month I url I ee I cdrom | cite I 
publisher I note I crossref I isbn I series I school I chapter"> 
< ! ELEMENT article (%field; )*>...<! ELEMENT www (%field;)*> 

Fig. 1. Extract of the DBLP DTD 

III. The DBLP Data Set 

The study of the DBLP data set in this section lays the 
foundations for our data generator. The analysis of frequency 
distributions in scientific production has first been discussed 
in [32], and characteristics of DBLP have been investigated 
in [22]. The latter work studies a subset of DBLP, restricting 
DBLP to publications in database venues. It is shown that 
(this subset of) DBLP reflects vital social relations, forming 
a "small world" on its own. Although this analysis forms 
valuable groundwork, our approach is of more pragmatic 
nature, as we approximate distributions by concrete functions. 

We use function families that naturally reflect the scenarios, 
e.g. logistics curves for modeling limited growth or power 
equations for power law distributions. All approximations have 
been done with the ZunZun 4 data modeling tool and the 
gnuplot 5 curve fitting module. Data extraction from the DBLP 
XML data was realized with the MonetDB/XQuery 6 processor. 

An important objective of this section is also to provide 
insights into key characteristics of DBLP data. Although it is 
impossible to mirror all relations found in the original data, 
we work out a variety of interesting relationships, considering 
entities, their structure, or the citation system. The insights 
that we gain establish a deep understanding of the benchmark 
queries and their specific challenges. As an example, Q3a, 
Q3b, and Q3c (see Appendix) look similar, but pose different 
challenges based on the probability distribution of article 
properties discussed within this section; Q7, on the other hand, 
heavily depends on the DBLP citation system. 

Although the generated data is very similar to the original 
DBLP data for years up to the present, we can give no 
guarantees that our generated data goes hand in hand with the 
original DBLP data for future years. However, and this is much 
more important, even in the future the generated data will 
follow reasonable (and well-known) social-world distributions. 
We emphasize that the benchmark queries are designed to 
primarily operate on top of these relations and distributions, 
which makes them realistic, predictable and understandable. 
For instance, some queries operate on top of the citation 
system, which is mirrored by our generator. In contrast, the 
distribution of article release months is ignored, hence no 
query relies on this property. 

A. Structure of Document Classes 

Our starting point for the discussion is the DBLP DTD 
and the February 25, 2008 version of DBLP. An extract of 

4 http://www.zunzun.com 
5 http://www.gnuplot.info 
s http://monetdb.cwi.nl/XQuery/ 



the DTD is provided in Figure Q] The dblp element defines 
eight child entities, namely ARTICLE, INPROCEEDINGS, . . ., 
and WWW resources. We call these entities document classes, 
and instances thereof documents. Furthermore, we distinguish 
between PROCEEDINGS documents, called conferences, and 
instances of the remaining classes, called publications. 

The DTD defines 22 possible child tags, such as author 
or url, for each document class. They describe documents, 
and we call them attributes in the following. According to 
the DTD, each document might be described by arbitrary 
combination of attributes. Even repeated occurrences of the 
same attribute are allowed, e.g. a document might have several 
authors. However, in practice only a subset of all document 
class/attribute combinations occurs. For instance, (as one 
might expect) attribute pages is never associated with WWW 
documents, but typically associated with ARTICLE entities. In 
Table U we show, for selected document class/attribute pairs, 
the probability that the attribute describes a document of this 
class 7 . To give an example, about 92.61% of all ARTICLE 
documents are described by the attribute pages. 

This probability distribution forms the basis for generating 
document class instances. Note that we simplify and assume 
that the presence of an attribute does not depend on the 
presence of other attributes, i.e. we ignore conditional proba- 
bilities. We will elaborate on this decision in Section IVIII 

Repeated Attributes. A study of DBLP reveals that, in 
practice, only few attributes occur repeatedly within single 
documents. For the majority of them, the number of repeated 
occurrences is diminishing, so we restrict ourselves on the 
most frequent repeated attributes cite, editor, and author. 

Figure |3a) exemplifies our analysis for attribute cite. It 
shows, for each documents with at least one cite occurrence, 
the probability (y-axis) that the document has exactly n cite 
attributes (x-axis). According to Table [I] only a small fraction 
of documents are described by cite (e.g. 4.8% of all ARTICLE 
documents). This value should be close to 100% in real world, 
meaning that DBLP contains only a fraction of all citations. 
This is also why, in Figure EJa), we consider only documents 
with at least one outgoing citation; when assigning citations 
later on, however, we first use the probability distribution of 
attributes in Table Q] to estimate the number of documents with 
at least one outgoing citation and afterwards apply the citation 
distribution in Figure 12a). This way, we exactly mirror the 
distribution found in the original DBLP data. 

Based on experiments with different function families, we 
decided to use bell-shaped Gaussian curves for data approx- 
imation. Such functions are typically used to model normal 
distributions. Strictly speaking, our data is not normally dis- 
tributed (i.e. there is the left limit x = 1), however, these 
curves nicely fit the data for x > 1 (cf. Figure 12 a)). Gaussian 
curves are described by functions 

where p, € M fixes the x-position of the peak and a € R>o 

7 The full correlation matrix can be found in Table llXl in the Appendix. 



TABLE I 

Probability distribution for selected attributes 





Article 


Inproc. 


Proc. 


Book 


WWW 


author 


0.9895 


0.9970 


0.0001 


0.8937 


0.9973 


cite 


0.0048 


0.0104 


0.0001 


0.0079 


0.0000 


editor 


0.0000 


0.0000 


0.7992 


0.1040 


0.0004 


isbn 


0.0000 


0.0000 


0.8592 


0.9294 


0.0000 


journal 
month 


0.9994 
0.0065 


0.0000 
0.0000 


0.0004 
0.0001 


0.0000 
0.0008 


0.0000 
0.0000 


pages 
title 


0.9261 
1.0000 


0.9489 
1.0000 


0.0000 
1.0000 


0.0000 
1.0000 


0.0000 
1.0000 



specifies the statistical spread. For instance, the approximation 
function for the cite distribution in Figure |2|a) is defined by 

ddte(x) := p^auss' 10 ° 7 ' ) {x). The analysis and the resulting 
distribution of repeated editor attributes is structurally similar 

and is described by the function d e ditor(x) := plus's" 1S \x). 

The approximation function for repeated author attributes 
bases on a Gaussian curve, too. However, we observed that 
the average number of authors per publication has increased 
over the years. The same observation was made in [22] 
and explained by the increasing pressure to publish and the 
proliferation of new communication platforms. Due to the 
prominent role of authors, we decided to mimic this property. 
As a consequence, parameters /i and a are not fixed (as it was 
the case for the distributions d C i te and d e ditor), but modeled as 
functions over time. More precisely, /i and a are realized by 
limited growth functions 8 (so-called logistic curves) that yield 
higher values for later years. The distribution is described by 

dauth{x,yr) d Mp%^ ivrU - th{vr) \x), where 
Vauthiyr) := i + i 7 .5 9e -Q i 5 i(^-i 9 75) + 1-05, and 

&auth(y r ) '-~ 1 + g^ 46e - 1 o'.10( !/ r- 1975) + 0-50. 

We will discuss the logistic curve function type in more 
detail in the following subsection. 

B. Key Characteristics of DBLP 

We next investigate the quantity of document class instances 
over time. We noticed that DBLP contains only few and 
incomplete information in its early years, and also found 
anomalies in the final years, mostly in form of lowered 
growth rates. It might be that, in the coming years, some 
more conferences for these years will be added belatedly 
(i.e. data might not yet be totally complete), so we restrict 
our discussion to DBLP data ranging from 1960 to 2005. 

Figure [2b) plots the number of PROCEEDINGS, JOURNAL, 
Inproceedings, and Article documents as a function of 
time. The y-axis is in log scale. Note that JOURNAL is not an 
explicit document class, but implicitly defined by the journal 
attribute of ARTICLE documents. We observe that inproceed- 
ings and articles are closely coupled to the proceedings and 
journals. For instance, there are always about 50-60 times more 

8 We make the reasonable assumption that the number of coauthors will 
eventually stabilize. 
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Fig. 2. (a) Distribution of citations, (b) Document class instances, and (c) Publication counts 



inproceedings than proceedings, which indicates the average 
number of inproceedings per proceeding. 

Figure |2b) shows exponential growth for all document 
classes, where the growth rate of JOURNAL and ARTICLE 
documents decreases in the final years. This suggests a limited 
growth scenario. Limited growth is typically modeled by 
logistic curves, which describe functions with a lower and an 
upper asymptote that either continuously increase or decrease 
for increasing x. We use curves of the form 

flogistici'E*) ~ i+b e -cx ' 

where a,b,c £ R>o- For this parameter setting, a constitutes 
the upper asymptote and the x-axis forms the lower asymptote. 
The curve is "caught" in-between its asymptotes and increases 
continuously, i.e. it is S'-shaped. The approximation function 
for the number of JOURNAL documents, which is also plotted 
in Figure |2jb), is defined by the formula 



fdocs(yr) := 



/journal (jj^j 



def 



740.43 



1+426. 28e-°- 12 <!"'- 1 95°) ' 

Approximation functions for ARTICLE, PROCEEDINGS, IN- 
PROCEEDINGS, Book, and Incollection documents differ 
only in the parameters. PhD Theses, Masters Theses, 
and WWW documents were distributed unsteadily, so we 
modeled them by random functions. It is worth mentioning 
that the number of articles and inproceedings per year clearly 
dominates the number of instances of the remaining classes. 
The concrete formulas look as follows. 



58519.12 



def _ 

: ~ 1+876. 80e-o 12 («''- 1 9 5 o) 
5502.31 



f article yV^j 

fproc(y r ) •— l + 1250.26e- 014 (» r - 19fi5 ) 

f. ( llr \ d - e -l 337132.34 

Jznproc\yi I ■— 1 + 1901. 05e-° 15 (»''- 19fi5 ) 

f n(vr) •= 3577 ' 31 

Jincollyyi I ■ igg 4g e -0.09( H r-1980) 

r / \ def 52.97 
Jbook{y r ) '■= 40739 38 e -o.32(i/j--i950) 
def 

fphd(yr) := random[0. .20] 
fmasters(yr) := random[Q. . 10] 

def 

fwww(yr) := random[Q. .10] 

C. Authors and Editors 

Based on the previous analysis, we can estimate the number 
of documents fdocs in yr by summing up the individual counts: 



j journal (Z/^y ~t~ farticleiy^ ~t~ fprociy^ 
finprociy^ ~t~ fincoll + fbook(yr)+ 
fphd(yr) + frnasters(yr) + fwww(yr), 



The total number of authors, which we define as the number 
of author attributes in the data set, is computed as follows. 
First, we estimate the number of documents described by 
attribute author for each document class individually (using the 
distribution in TableQ]i- All these counts are summed up, which 
gives an estimation for the total number of documents with 
one or more author attributes. Finally, this value is multiplied 
with the expected average number of authors per paper in the 
respective year (implicitly given by d au th in Section UlI- All . 

To be close to reality, we also consider the number of 
distinct persons that appear as authors (per year), called 
distinct authors, and the number of new authors in a given 
year, i.e. those persons that publish for the first time. 

We found that the number of distinct authors f dauth per 
year can be expressed in dependence of f au th as follows. 



def 



fdauthiyr) := ( 1+169 4^-0.07(^-1936) + 0-84) * f au th(yr) 

The equation above indicates that the number of distinct 
authors relative to the total authors decreases steadily, from 
0.84% to 0.84%-0.67% = 0.17%. Among others, this reflects 
the increasing productivity of authors over time. 

The formula for the number f new of new authors builds on 
the previous one and also builds upon a logistic curve: 



def 



fnew(yr) := ( 1749 013^014(^-1937) + 0-628) * fdauthiyr) 

Publications. In Figure |2jc) we plot, for selected year and 
publication count x, the number of authors with exactly x 
publications in this year. The graph is in log-log scale. We 
observe a typical power law distribution, i.e. there are only a 
couple of authors having a large number of publications, while 
lots of authors have only few publications. 

Power law distributions are modeled by functions of the 
form fpoweriaw(x) = ax k + b, with constants a £ R>o, 
exponent k G K<o, and b e R. Parameter a affects the x-axis 
intercept, exponent k defines the gradient, and b constitutes 
a shift in y-direction. For the given parameter restriction, the 
functions decrease steadily for increasing x > 0. 

Figure |2|c) shows that, throughout the years, the curves 
move upwards. This means that the publication count of the 



leading author(s) has steadily increased over the last 30 years, 
and also reflects an increasing number of authors. We estimate 
the number of authors with x publications in year yr as 

faw P (x,yr) := 1.50f publ (yr)x~ f ^p {yr} - 5, where 

fawp(y r ) : ~ l+216223e-°- 20 <!' r - 1936 > an d 

fpubi{yr) returns the total number of publications in yr. 

Coauthors. In analyzing coauthor characteristics, we inves- 
tigated relations between the publication count of authors and 
the number of its total and distinct coauthors. Given a number 
x of publications, we (roughly) estimate the average number 
of total coauthors by fi C oauth '■= 2.12*x and the number of its 
distinct coauthors by Hdcoauth '■— a; ' 81 - We take these values 
into consideration when assigning coauthors. 

Editors. The analysis of authors is complemented by a 
study of their relations to editors. We associate editors with 
authors by investigating the editors' number of publications 
in (earlier) venues. As one might expect, editors often have 
published before, i.e. are persons that are known in the com- 
munity. The concrete formula is rather technical and omitted. 

D. Citations 

In Section IIII-AI we have studied repeated occurrences of 
attribute cite, i.e. outgoing citations. Concerning the incoming 
citations (i.e. the count of incoming references for papers), we 
observed a characteristic power law distribution: Most papers 
have few incoming citations, while only few are cited often. 
We omit the concrete power law approximation function. 

We also observed that the number of incoming citations 
is smaller than the number of outgoing citations. This is 
because DBLP contains many untargeted citations (i.e. empty 
cite tags). Recalling that only a fraction of all papers have 
outgoing citations (cf. Section IIII-Ab . we conclude that the 
DBLP citation system is very incomplete. 

IV. Data Generation 

The RDF Data Model. From a logical point of view, RDF 
data bases are collections of so-called triples of knowledge. 
A triple (subject,predicate,object) models the binary relation 
predicate between subject and object and can be visualized in 
a directed graph by an edge from the subject node to an object 
node under label predicate. Figure [3jb) shows a sample RDF 
graph, where dashed lines represent edges that are labeled with 
rdf:type, and sc is an abbreviation for rdfs:subClassOf. For 
instance, the arc from node Proceeding! to node _:JohnJ)ue 
represents the triple (Proceeding! ,swrc:editor,_-Jolm_Due). 

RDF graphs may contain three types of nodes. First, URIs 
(Uniform Resource Identifiers) are strings that uniquely iden- 
tify abstract or physical resources, such as conferences or 
journals. Blank nodes have an existential character, i.e. are 
typically used to denote resources that exist, but are not 
assigned a fixed URI. We represent URIs and blank nodes by 
ellipses, identifying blank nodes by the prefix Literals 
represent (possibly typed) values and usually describe URIs 
or blank nodes. Literals are represented by quoted strings. 



The RDF standard [1] introduces a base vocabulary with 
fixed semantics, e.g. defines URI rdf:type for type specifica- 
tions. This vocabulary also includes containers, such as bags 
or sequences. RDFS [25] extends the RDF vocabulary and, 
among others, provides URIs for subclass (rdfs:subClassOf) 
and subproperty (rdf:subPropertyOf) specifications. On top of 
RDF and RDFS, one can easily create user-defined, domain- 
specific vocabularies. Our data generator makes heavy use of 
such predefined vocabulary collections. 

The DBLP RDF Scheme. Our RDF scheme basically 
follows the approach in [31], which presents an XML-to-RDF 
mapping of the original DBLP data. However, we want to 
generate arbitrarily-sized documents and provide lists of first 
and last names, publishers, and random words to our data 
generator. Conference and journal names are always of the 
form "Conference $i ($year)" and "Journal $i ($yearj\ where 
%i is a unique conference (resp. journal) number in year $year. 

Similar to [31], we use existing RDF vocabularies to de- 
scribe resources in a uniform way. We borrow vocabulary from 
FOAF 9 for describing persons, and from SWRC 10 and DC 11 
for describing scientific resources. Additionally, we introduce 
a namespace bench, which defines DBLP-specific document 
classes, such as bench:Book and bench : Article. Fig- 
ure [3t a) shows the translation of attributes to RDF properties. 
For each attribute, we also list its range restriction, i.e. the type 
of elements it refers to. For instance, attribute author is mapped 
to dc:creator, and references objects of type foaf : Person. 

The original DBLP RDF scheme neither contains blank 
nodes nor RDF containers. As we want to test our queries on 
top of such RDF-specific constructs, we use (unique) blank 
nodes " _: givenname Jastname" for persons (instead of URIs) 
and model outgoing citations of documents using standard 
rdf :Bag containers. We also enriched a small fraction of 
Article and Inproceedings documents with the new prop- 
erty bench :abstract (about 1%, keeping the modification low), 
which constitutes comparably large strings (using a Gaussian 
distribution with \i = 150 expected words and a = 30). 

Figure [3jb) shows a sample DBLP instance. On the logical 
level, we distinguish between the schema layer (gray) and 
the instance layer (white). Reference lists are modeled as 
blank nodes of type rdf : Bag, i.e. using standard RDF 
containers (see node preferences!). Authors and editors are 
represented by blank nodes of type foaf : Person. Class 
foaf : Document splits up into the individual document 
classes bench : Journal, bench : Article, and so on. 
Our graph defines three persons, one proceeding, two inpro- 
ceedings, one journal, and one article. For readability reasons, 
we plot only selected predicates. As also illustrated, property 
dcterms:partOf links inproceedings and proceedings together, 
while swrc:journal connects articles to their journals. 

In order to provide an entry point for queries that access 
authors and to provide a person with fixed characteristics, we 

9 http ://w w w.f oaf-proj ect.org/ 
10 http://ontoware.org/projects/swrc/ 
11 http://dublincore.org/ 



attribute 


mapped to prop. 


refers to 


address 


swrc : address 


xsd : 


string 


author 


dc : creator 


foaf 


: Person 


booktitle 


bench : booktitle 


xsd: 


string 


cdrom 


bench : cdrom 


xsd: 


string 


chapter 


swrc : chapter 


xsd: 


integer 


cite 


dcterms : references 


foaf 


: Document 


crossref 


dcterms :partOf 


foaf 


: Document 


editor 


swrc : editor 


foaf 


: Person 


ee 


rdf s : seeAlso 


xsd : 


string 


isbn 


swrc : isbn 


xsd: 


string 


journal 


swrc : j ournal 


bench : Journal 


month 


swrc : month 


xsd: 


integer 


note 


bench : note 


xsd: 


string 


number 


swrc : number 


xsd: 


integer 


page 


swrc : pages 


xsd : 


string 


publisher 


dc : publ i sher 


xsd: 


string 


school 


dc : publisher 


xsd : 


string 


series 


swrc : series 


xsd: 


integer 


title 


dc : title 


xsd : 


string 


url 


f oaf : homepage 


xsd: 


string 


volume 


swrc : volume 


xsd: 


integer 


year 


dcterms : issued 


xsd: 


integer 





Fig. 3. (a) Translations of attributes, and (b) DBLP sample instance in RDF format 



foreach year: 
calculate counts for and generate document classes; 
calculate nr of total, new, distinct, and retiring authors; 

choose publishing authors; 

assign nr of new publications, nr of coauthors, and 

nr of distinct coauthors to publishing authors; 

// s.t. constraints for nr of publications/author hold 

assign from publishing authors to papers; 

// satisfying authors per paper/co authors constraints 

choose editors and assign editors to papers; 

// s.t. constraints for nr of publications/editors hold 

generate outgoing citations; 

assign expected incoming/outgoing citations to papers; 

write output until done or until output limit reached; 
// permanently keeping output consistent 



Fig. 4. Data generation algorithm 



created a special author, named after the famous mathemati- 
cian Paul Erdos. Per year, we assign 10 publications and 2 
editor activities to this prominent person, starting from year 
1940 up to 1996. For the ease of access, Paul Erdos is modeled 
by a fixed URI. As an example query consider Q8, which 
extracts all persons with Erdos Number 12 1 or 2. 

Data Generation. Our data generator was implemented in 
C++. It takes into account all relationships and characteristics 
that have been studied in Section [III] Figure [4] shows the 
key steps in data generation. We simulate year by year and 
generate data according to the structural constraints in a 
carefully selected order. As a consequence, data generation 
is incremental, i.e. small documents are always contained in 
larger documents. 

The generator offers two parameters, to fix either a triple 
count limit or the year up to which data will be generated. 
When the triple count limit is set, we make sure to end up 
in a "consistent" state, e.g. whenever proceedings are written, 
the corresponding conference will be included. 

The generation process is simulation-based. Among others, 
this means that we assign life times to authors, and individually 
estimate their future behavior, taking into account global 
publication and coauthor characteristics, as well as the fraction 
of distinct and new authors (cf. Section IIII-CI ). 

All random functions (which, for example, are used to 
assign the attributes according to Table U) base on a fixed seed. 
This makes data generation deterministic, i.e. the parameter 
setting uniquely identifies the outcome. As data generation is 
also platform-independent, we ensure that experimental results 
from different machines are comparable. 

V. Benchmark Queries 

The SPARQL Query Language. SPARQL is a declarative 
language and bases upon a powerful graph matching facility, 
allowing to match query subexpressions against the RDF input 

I2 See http://www.oakland.edu/enp/. 



graph. The very basic SPARQL constructs are triple patterns 
(subject, predicate, object), where variables might be used in 
place of fixed values for each of the three components. In 
evaluating SPARQL, these patterns are mapped against one 
or more input graphs, thereby binding variables to matching 
nodes or edges in the graph(s). Since we are primarily inter- 
ested in database aspects, such as operator constellations and 
access patterns, we focus on queries that access a single graph. 

The SPARQL standard [3] defines four distinct query forms. 
Select queries retrieve all possible variable-to-graph map- 
pings, while Ask queries return yes if at least one such map- 
ping exists, and no otherwise. The DESCRIBE form extracts 
additional information related to the result mappings (e.g. adja- 
cent nodes), while CONSTRUCT transforms the result mapping 
into a new RDF graph. The most appropriate for our purpose is 
Select, which best reflects SPARQL core evaluation. Ask 
queries are also interesting, as they might affect the choice 
of the query execution plan (QEP). In contrast, CONSTRUCT 
and Describe build upon the core evaluation of Select, 
i.e. transform its result in a post-processing step. This step 



TABLE II 





Selected properties of the benchmark queries; shortcuts are indicated i 


3Y BOLD FONT 








Query 


1 2 3abc 4 5ab 6 7 


8 9 


10 


11 


12c 



1 Operators: And, Filter.Union, Optional a a,o a,f a,f a,f a,f,o a,f,o a,f,u a,u 

2 Modifiers: DlSTINCT,LlMIT,OfFSET,ORDER bY Ob - D D D D D - L,Ob,Of 

4 Filter Pushing Possible? S - SI- / / ■/ 

5 Reusing of Graph Pattern Possible? •/ - •/ •/ •/ •/ 

6 Data Access: BLANK NODES, LITERALS, URIS, L,U L,U,La L,U B,L,U B,L,U B,L,U L,U,C B,L,U B,L,U U L,U 



LaRGE Literals, Containers 



is not very challenging from a database perspective, so we 
focus on Select and Ask queries (though, on demand, these 
queries could easily be translated into the other forms). 

The most important SPARQL operator is And (denoted 
as "."). If two SPARQL expressions A and B are connected 
by And, the result is computed by joining the result mappings 
of A and B on their shared variables [4], Let us consider Ql 
from the Appendix, which defines three triple patterns inter- 
connected through And. When first evaluating the patterns 
individually, variable ? journal is bound to nodes with (1) edge 
rdf:type pointing to the URI bench : Journal, (2) edge 
dc:title pointing to the Literal "Journal 1 (1940)" of type string, 
and (3) edge dcterms:issued, respectively. The next step is to 
join the individual mapping sets on variable '/journal. The 
result then contains all mappings from '/journal to nodes that 
satisfy all three patterns. Finally SELECT projects for variable 
?yr, which has been bound in the third pattern. 

Other SPARQL operators are Union, Optional, and Fil- 
ter, akin to relational unions, left outer joins, and selections, 
respectively. For space limitations, we omit an explanation 
of these constructs and refer the reader to the SPARQL 
semantics [3]. Beyond all these operators, SPARQL provides 
functions to be used in FILTER expressions, e.g. for reg- 
ular expression testing. We expect these functions to only 
marginally affect engine performance, since their implemen- 
tation is mostly straightforward (or might be realized through 
efficient libraries). They are unlikely to bring insights into the 
core evaluation capabilities, so we omit them intentionally. 
This decision also facilitates benchmarking of research proto- 
types, which typically do not implement the full standard. 

The SP 2 Bench queries also cover SPARQL solution mod- 
ifiers, such as Distinct, Order By, Limit, and Offset. 
Like their SQL counterparts, they might heavily affect the 
choice of an efficient QEP, so they are relevant for our 
benchmark. We point out that the previous discussion captures 
virtually all key features of the SPARQL query language. In 
particular, SPARQL does (currently) not support aggregation, 
nesting, or recursion. 

SPARQL Characteristics. Rows 1 and 2 in Table [TT] survey 
the operators used in the SELECT benchmark queries (the 
ASK-queries Q12a and Q126 share the characteristics of their 
Select counterparts Q5a and Q8, respectively, and are not 
shown). The queries cover various operator constellations, 
combined with selected solution modifiers combinations. 

One very characteristic SPARQL feature is operator OP- 
TIONAL. An expression A OPTIONAL {B} joins result map- 
pings from A with mappings from B, but - unlike And - 



retains mappings from A for which no join partner in B is 
present. In the latter case, variables that occur only inside B 
might be unbound. By combining OPTIONAL with FILTER and 
function BOUND, which checks if a variable is bound or not, 
one can simulate closed world negation in SPARQL. Many 
interesting queries involve such an encoding (c.f. Q6 and Q7). 

SPARQL operates on graph-structured data, thus engines 
should perform well on different kinds of graph patterns. 
Unfortunately, up to the present there exist only few real-world 
SPARQL scenarios. It would be necessary to analyze a large 
set of such scenarios, to extract graph patterns that frequently 
occur in practice. In the absence of this possibility, we distin- 
guish between long path chains, i.e. nodes linked to each other 
node via a long path, bushy patterns, i.e. single nodes that are 
linked to a multitude of other nodes, and combinations of these 
two patterns. Since it is impossible to give a precise definition 
of "long" and "bushy", we designed meaningful queries that 
contain comparably long chains (i.e. QA, Q6) and comparably 
bushy patterns (i.e. Q2) w.r.t. our scenario. These patterns 
contribute to the variety of characteristics that we cover. 

SPARQL Optimization. Our objective is to design queries 
that are amenable to a variety of SPARQL optimization 
approaches. To this end, we discuss possible optimization 
techniques before presenting the benchmark queries. 

A promising approach to SPARQL optimization is the re- 
ordering of triple patterns based on selectivity estimation [5], 
akin to relational join reordering. Closely related to triple 
reordering is FILTER pushing, which aims at an early eval- 
uation of filter conditions, similar to projection pushing in 
Relational Algebra. Both techniques might speed up evaluation 
by decreasing the size of intermediate results. An efficient join 
order depends on selectivity estimations for triple patterns, 
but might also be affected by available data access paths. 
Join reordering might apply to most of our queries. Row 4 
in Table HI1 lists the queries that support FILTER pushing. 

Another idea is to reuse evaluation results of triple patterns 
(or even combinations thereof). This might be possible when- 
ever the same pattern is used multiple times. As an example 
consider QA. Here, ?articlel and /article! in the first and 
second triple pattern will be bound to the same nodes. We 
survey the applicability of this technique in Table [Til row 5. 

RDF Characteristics and Storage. SPARQL has been 
specifically designed to operate on top of RDF [1] rather 
than RDFS [25] data. Although it is possible to access RDFS 
vocabulary with SPARQL, the semantics of RDFS [26] is 
ignored when evaluating such queries. Consider for example 
the rdfs:subClassOf property, which is used to model sub- 



class relationships between entities, and assume that class 
Student is a subclass of Person. A SPARQL query like 
"Select all objects of type Person" then does not return 
students, although according to [26] each student is also a 
person. Hence, queries that cover RDFS inference make no 
sense unless the SPARQL standard is changed accordingly. 

Recalling that persons are modeled as blank nodes, all 
queries that deal with persons access blank nodes. Moreover, 
one of our queries operates on top of the RDF bag container 
for reference lists (Q7), and one accesses the comparably large 
abstract literals (Q2). Row 6 in Table HI1 provides a survey. 

A comparison of RDF storage strategies is provided in [12]. 
Storage scheme and indices finally imply a selection of effi- 
cient data access paths. Our queries impose varying challenges 
to the storage scheme, e.g. test data access through RDF 
subjects, predicates, objects, and combinations thereof. In most 
cases, predicates are fixed and subject and/or object vary, but 
we also test more uncommon access patterns. We will resume 
this discussion when describing Q9 and Q10. 

A. Benchmark Queries 

The benchmark queries also vary in general characteristics 
like selectivity, query and output size, and different types of 
joins. We will point out such characteristics in the subsequent 
individual discussion of the benchmark queries. 

In the following, we distinguish between in-memory en- 
gines, which load the document from file and process queries 
in main memory, and native engines, which rely on a physical 
database system. When discussing challenges to and evaluation 
strategies for native engines, we always assume that the 
document has already been loaded in the database before. 

We finally emphasize that in this paper we focus on the 
SPARQL versions of our queries, which can be processed 
directly by real SPARQL engines. One might also be interested 
in the SQL-translations of these queries available at the 
SP 2 Bench project page. We refer the interested reader to [33] 
for an elaborate discussion of these translations. 



Ql. Return the year of publication of "Journal 1 (1940)". 

This simple query returns exactly one result (for arbitrarily 
large documents). Native engines might use index lookups 
in order to answer this query in (almost) constant time, 
i.e. execution time should be independent from document size. 



Q2. Extract all inproceedings with properties dc:creator, 
bench :booktitle, dcterms:issued, dcterms:partOf, rdfs:seeAlso, 
dc:title, swrc:pages, foaf:homepage, and optionally 
bench :abstract, including their values. 

This query implements a bushy graph pattern. It contains 
a single, simple OPTIONAL expression, and accesses large 
strings (i.e. the abstracts). Result size grows with database 
size, and a final result ordering is necessary due to operator 
ORDER By. Both native and in-memory engines might reach 
evaluation times that are almost linear to the document size. 



Q3abc. Select all articles with property (a) swrc:pages, 
(b) swrc:month, or (c) swrc:isbn. 

This query tests FILTER expressions with varying selectivity. 
According to Table [I] the FILTER expression in Q3a is not 
very selective (i.e. retains about 92.61% of all articles). Data 
access through a secondary index for Q3a is probably not very 
efficient, but might work well for Q3b, which selects only 
0.65% of all articles. The filter condition in Q3c is never 
satisfied, as no articles have swrc:isbn predicates. Schema 
statistics might be used to answer Q3c in constant time. 

Q4. Select all distinct pairs of article author names for authors 
that have published in the same journal. 

Q4 contains a rather long graph chain, i.e. variables ?namel 
and ?name2 are linked through the articles the authors have 
published, and a common journal. The result is very large, 
basically quadratic in number and size of journals. Instead 
of evaluating the outer pattern block and applying the FILTER 
afterwards, engines might embed the FILTER expression in the 
computation of the block, e.g. by exploiting indices on author 
names. The DISTINCT modifier further complicates the query. 
We expect superlinear behavior, even for native engines. 

Q5ab. Return the names of all persons that occur as author 
of at least one inproceeding and at least one article. 

Queries Q5a and Q5b test different variants of joins. Q5a 
implements an implicit join on author names, which is encoded 
in the FILTER condition, while Q5b explicitly joins the authors 
on variable name. Although in general the queries are not 
equivalent, the one-to-one mapping between authors and their 
names (i.e. author names constitute primary keys) in our 
scenario implies equivalence. In [14], semantic optimization 
on top of such keys for RDF has been proposed. Such an 
approach might detect the equivalence of both queries in this 
scenario and select the more efficient variant. 



Q6. Return, for each year, the set of all publications authored 
by persons that have not published in years before. 

Q6 implements closed world negation (CWN), expressed 
through a combination of operators OPTIONAL, FILTER, and 
BOUND. The idea of the construction is that the block outside 
the OPTIONAL expression computes all publications, while 
the inner one constitutes earlier publications from authors 
that appear outside. The outer FILTER expression then retains 
publications for which ?author2 is unbound, i.e. exactly the 
publications of authors without publications in earlier years. 

Q7. Return the titles of all papers that have been cited at least 
once, but not by any paper that has not been cited itself. 

This query tests double negation, which requires nested CWN. 
Recalling that the citation system of DBLP is rather incom- 
plete (cf. Section Ull-Db . we expect only few results. Though, 
the query is challenging due to the double negation. Engines 
might reuse graph pattern results, for instance, the block 
?class[i] rdf:type foaf:Document. ?doc[i] rdf:type ?class[i]. 
occurs three times, for empty [i], [i]=3, and [i]=4. 

Q8. Compute authors that have published with Paul Erdds or 
with an author that has published with Paul Erdds. 

Here, the evaluation of the second UNION part is basically 
"contained" in the evaluation of the first part. Hence, tech- 
niques like graph pattern (or subexpression) reusing might 
apply. Another very promising optimization approach is to de- 
compose the filter expressions and push down its components, 
in order to decrease the size of intermediate results. 



TABLE III 
Document generation evaluation 

#triples 10 3 10 4 10 5 10 6 10 7 10 s 10 9 

elapsed time [s] 0.08 0.13 0.60 5.76 70 1011 13306 



VI. Experiments 

All experiments were conducted under Linux ubuntu v7.10 
gutsy, on top of an Intel Core2 Duo E6400 2.13GHz CPU and 
3GB DDR2 667 MHz nonECC physical memory. We used a 
250GB Hitachi P7K500 SATA-II hard drive with 8MB Cache. 
The Java engines were executed with JRE vl.6.0_04. 

A. Data Generator 

To prove the practicability of our data generator, we mea- 
sured data generation times for documents of different sizes. 
Table [ill] shows the performance results for documents con- 
taining up to one billion triples. The generator scales almost 
linearly with document size and creates even large documents 
very fast (the 10 9 triples document has a physical size of 
about 103GB). Moreover, it runs with constant main memory 
consumption (i.e., gets by with about 1.2GB RAM). 

We verified the implementation of all characteristics from 
Section [Til] Table IVIIII shows selected data generator and 
output document characteristics for documents up to 25 M 
triples. We list the size of the output file, the year in which 
simulation ended, the number of total authors and distinct 
authors contained in the data set (cf. Section lUI-Ct . and the 
counts of the document class instances (cf. Section ITlI-Bb . We 
observe superlinear growth for the number of authors (w.r.t. the 
number of triples). This is primarily caused by the increasing 
average number of authors per paper (cf. Section IIII-Ab . 
The growth rate of proceedings and inproceedings is also 
superlinear, while the rate of journals and articles is sublinear. 
These observations reflect the yearly document class counts 
in Figure |2jb). We remark that - like in the original DBLP 
database - in the early years instances of several document 
classes are missing, e.g. there are no Book and Www 
documents. Also note that the counts of inproceedings and 
articles clearly dominate the remaining document classes. 

Table [V] surveys the result sizes for the queries on docu- 
ments up to 25M triples. We observe for example that the 
outcome of Q3a, Q3b, and Q'ic reflects the selectivities of 
their Filter attributes swrc:pages, swrc:month, and swrc:isbn 
(cf. Table U and IVIIII ). We will come back to the result size 
listing when discussing the benchmark results later on. 

B. Benchmark Metrics 

Depending on the scenario, we will report on user time 
(usr), system time (sys), and the high watermark of resident 
memory consumption (rmem). These values were extracted 
from the proc file system, whereas we measured elapsed time 
(trae) through timers. It is important to note that experiments 
were carried out on a DuoCore CPU, where the linux kernel 
sums up the usr and sys times of the individual processor 
units. As a consequence, in some scenarios the sum usr+sys 
might be greater than the elapsed time tme. 

We propose several metrics that capture different aspects 
of the evaluation. Reports of the benchmark results would, 
in the best case, include all these metrics, but might also 
ignore metrics that are irrelevant to the underlying scenario. 
We propose to perform three runs over documents comprising 



Q9. Return incoming and outgoing properties of persons. 

Q9 has been designed to test non-standard data access pat- 
terns. Naive implementations would compute the triple pat- 
terns of the UNION subexpressions separately, thus evalu- 
ate patterns where no component is bound. Then, pattern 
?subject ?preclicate ?person would select all graph 
triples, which is rather inefficient. Another idea is to evaluate 
the first triple in each UNION subexpression, afterwards using 
the bindings for variable ?person to evaluate the second pat- 
terns more efficiently. In this case, we observe patterns where 
only the subject (resp. the object) is bound. Also observe 
that this query extracts schema information. The result size 
is exactly 4 (for sufficiently large documents). Statistics about 
incoming/outgoing properties of Person-typed objects in native 
engines might be used to answer this query in constant time, 
even without data access. In-memory engines always must load 
the document, hence might scale linearly to document size. 



Q10. Return all subjects that stand in any relation to person 
"Paul Erdos". In our scenario the query can be reformulated 
as Return publications and venues in which "Paul Erdos" is 
involved either as author or as editor. 

Q10 implements an object bound-only access pattern. In 
contrast to Q9, statistics are not immediately useful, since the 
result includes subjects. Recall that "Paul Erdos" is active only 
between 1940 and 1996, so result size stabilizes for sufficiently 
large documents. Native engines might exploit indices and 
reach (almost) constant execution time. 



Qll. Return (up to) 10 electronic edition URLs starting from 
the 51 th publication, in lexicographical order. 

This query focuses on the combination of solution modifiers 
ORDER By, Limit, and OFFSET. In-memory engines have to 
read, process, and sort electronic editions prior to processing 
LIMIT and OFFSET. In contrast, native engines might exploit 
indices to access only a fraction of all electronic editions and, 
as the result is limited to 10, reach constant runtimes. 



Q12. (a) Return yes if a person occurs as author of at least 
one inproceeding and article, no otherwise; (b) Return yes if 
an author has published with Paul Erdos or with an author 
that has published with "Paul Erdos", and no otherwise.; (c) 
Return yes if person "John Q. Public" is present in the database. 

QYla and Q12b share the properties of their SELECT coun- 
terparts Q5a and Q8, respectively. They always return yes 
for sufficiently large documents. When evaluating such ASK 
queries, engines should break as soon a solution has been 
found. They might adapt the QEP, to efficiently locate a 
witness. For instance, based on execution time estimations it 
might be favorable to evaluate the second part of the UNION in 
Q12b first. Both native and in-memory engines should answer 
these queries very fast, independent from document size. 
QYlc asks for a single triple that is not present in the 
database. With indices, native engines might execute this query 
in constant time. Again, in-memory engines must scan (and 
hence, load) the whole document. 



10k, 50fc, 250/fc, 1M, 5M, and 25M triples, using a fixed 
timeout of 30min per query and document, always reporting 
on the average value over all three runs and, if significant, the 
errors within these runs. We point out that this setting can be 
evaluated in reasonable time (typically within few days). If 
the implementation is fast enough, nothing prevents the user 
from adding larger documents. All reports should, of course, 
include the hardware and software specifications. Performance 
results should list tme, and optionally usr and sys. In the 
following, we shortly describe a set of interesting metrics. 

1) Success Rate: We propose to separately report on 
the success rates for the engine on top of all document 
sizes, distinguishing between Success, Timeout (e.g. an 
execution time > 30mm as used in our experiments 
here), Memory Exhaustion (if an additional memory 
limit was set), and general Errors. This metric gives a 
good survey over scaling properties and might give first 
insights into the behavior of engines. 

2) Loading Time: The user should report on the loading 
times for the documents of different sizes. This metric 
primarily applies to engines with a database backend and 
might be ignored for in-memory engines, where loading 
is typically part of the evaluation process. 

3) Per-Query Performance: The report should include 
the individual performance results for all queries over all 
document sizes. This metric is more detailed than the 
Success Rate report and forms the basis for a deep 
study of the results, in order to identify strengths and 
weaknesses of the tested implementation. 

4) Global Performance: We propose to combine the 
per-query results into a single performance measure. 
Here we recommend to list for execution times the 
arithmetic as well as the geometric mean, which is 
defined as the n th root of the product over n numbers. 
In the context of SP 2 Bench, this means we multiply 
the execution time of all 17 queries (queries that failed 
should be ranked with 3600s, to penalize timeouts and 
other errors) and compute the 17 th root of this product 
(for each document size, accordingly). This metric is 
well-suited to compare the performance of engines. 

5) Memory Consumption: In particular for engines with 
a physical backend, the user should also report on 
the high watermark of main memory consumption and 
ideally also the average memory consumption over all 
queries (cf. Table EH and ED. 

C. Benchmark Results for Selected Engines 

It is beyond the scope of this paper to provide an in-depth 
comparison of existing SPARQL engines. Rather than that, 
we use our metrics to give first insights into the state-of-the 
art and exemplarily illustrate that the benchmark indeed gives 
valuable hints on bottlenecks in current implementations. In 
this line, we are not primarily interested in concrete values 
(which, however, might be of great interest in the general 
case), but focus on the principal behavior and properties of 
engines, e.g. discuss how they scale with document size. We 



will exemplarily discuss some interesting cases and refer the 
interested reader to the Appendix for the complete results. 

We conducted benchmarks for (1) the Java engine ARQ 13 
v2.2 on top of Jena 2.5.5, (2) the Redland 14 RDF Proces- 
sor vl.0.7 (written in C), using the Raptor Parser Toolkit 
v. 1.4. 16 and Rasqal Library vO.9.15, (3) SDB 15 , which link 
ARQ to an SQL database back-end (i.e., we used mysql 
v5.0.34) , (4) the Java implementation Sesame 16 v2.2beta2, 
and finally (5) OpenLink Virtuoso 17 v5.0.6 (written in C). 

For Sesame we tested two configurations: Sesame^, which 
processes queries in memory, and SesamenB, which stores 
data physically on disk, using the native Mulgara SAIL 
(vl.3betal). We thus distinguish between the in-memory en- 
gines {ARQ, Sesame m) and engines with physical backend, 
namely (Redland, SBD, Sesame db, Virtuoso). The latter can 
further be divided into engines with a native RDF store {Red- 
land, SesameuB, Virtuoso) and a relational database backend 
(SDB). For all physical-backend databases we created indices 
wherever possible (immediately after loading the documents) 
and consider loading and execution time separately (index 
creation time is included in the reported loading times). 

We performed three cold runs over all queries and docu- 
ments of lOfc, 50fc, 250fc, 1M, 5M, and 25M triples, i.e. in- 
between each two runs we restarted the engines and cleared 
the database. We set a timeout of 30min (tme) per query and 
a memory limit of 2.6GB, either using ulimit or restricting the 
JVM (for higher limits, the initialization of the JVM failed). 
Negative and positive variation of the average (over the runs) 
was < 2% in almost all cases, so we omit error bars. For 
SDB and Virtuoso, which follow a client-server architecture, 
we monitored both processes and sum up these values. 

We verified all results by comparing the outputs, observing 
that SDB and Redland returned wrong results for a couple 
of queries, so we restrict ourselves on the discussion of the 
remaining four engines. Table [IV] shows the success rates. 
All queries that are not listed succeeded, except for ARQ 
and SesameM on the 25M document (either due to timeout 
or memory exhaustion) and Virtuoso on Q6 (due to missing 
standard compliance). Hence, QA, Q5a, Q6, and Q7 are the 
most challenging queries, where we observe many timeouts 
even for small documents. Note that we did not succeed in 
loading the 25M triples document into the Virtuoso database. 

D. Discussion of Benchmark Results 

Main Memory. For the in-memory engines we observe 
that the high watermark of main memory consumption dur- 
ing query evaluation increases sublinearly to document size 
(cf. Table IVIb . e.g. for ARQ we measured an average (over 
runs and queries) of 85MB on lOfc, 166MB on 50fc, 318MB on 
250A:, 526MB on IM, and 1.3GB on 5M triples. Somewhat 

1 3 http://jena. sourceforge.net/ARQ/ 

14 http://librdf.org/ 

15 http://jena. sourceforge.net/SDB/ 

I6 http://www.openrdf.org/ 

1 7 http ://w w w.openlinksw. com/ virtuoso/ 



TABLE IV 

Success rates for queries on RDF documents up to 25M triples. Queries are encoded in hexadecimal (e.g., 'A' stands for Q10). We 
use the shortcuts +:=Success, T:=Timeout, M:=Memory Exhaustion, and E:=Error. 





ARQ 


Sesame m 


Sesame bb 


Virtuoso 


Query 


123 45 6789ABC 
abc ab 


123 45 6789ABC 
abc ab 


123 45 6789ABC 
abc ab 


123 45 6789ABC 
abc ab 




10k 
50k 
250k 


+++++++++++++++++ 
+++++++++++++++++ 
+++++T+++++++++++ 


+++++++++++++++++ 
+++++++++++++++++ 
++++++T+T++++++++ 


+++++++++++++++++ 
+++++++++++++++++ 
++++++T+TT+++++++ 


++++++++E++++++++ 
++++++++E++++++++ 
+++++TT+E++++++++ 


1M 


+++++TT+TT+++++++ 


++++++T+TT+++++++ 


++++++T+TT+++++++ 


+++++TTTET+++++++ 


5M 


+++++TT+TT+++++++ 


+++++TT+TT+++++++ 


+++++MT+TT+++++++ 


+++++TTTET+++++++ 


25M 




MMMMMMMTMMMMMTMMT 


+++++TT+TT+++++++ 


(loading of document failed) 



TABLE V 

Number of query results on documents up to 25 million triples 



Query 


Ql 


Q2 


Q3a 


Q36 


Q3c 


Q4 


Q5a 


Q5b 


Q6 


Q7 


Q8 


Q9 


Q10 


Qll 


lOfe 


1 


147 


846 


9 





23226 


155 


155 


229 





184 


4 


166 


10 


50fc 


1 


965 


3647 


25 





104746 


1085 


1085 


1769 


2 


264 


4 


307 


10 


250A; 


1 


6197 


15853 


127 





542801 


6904 


6904 


12093 


62 


332 


4 


452 


10 


1M 


1 


32770 


52676 


379 





2586733 


35241 


35241 


62795 


292 


400 


4 


572 


10 


5M 


1 


248738 


192373 


1317 





18362955 


210662 


210662 


417625 


1200 


493 


4 


656 


10 


25M 


1 


1876999 


594890 


4075 





n/a 


696681 


696681 


1945167 


5099 


493 


4 


656 


10 



TABLE VI 

Arithmetic and geometric means of execution time (T a /T g ) and 

ARITHMETIC MEAN OF MEMORY CONSUMPTION (M a ) FOR THE 
IN-MEMORY ENGINES 







ARQ 






Sesame m 








T a [s] 


T 9 [s] 


M a [MB] 


T a [s] 


T 9 [s] 


M a [MB] 


250k 


491.87 


56.35 


318.25 


442.47 


28.64 


272.27 


1M 


901.73 


179.42 


525.61 


683.16 


106.38 


561.79 


5M 


1154.80 


671.41 


1347.55 


1059.03 


506.14 


1437.38 



TABLE VII 

Arithmetic and geometric means of execution time (T a /T g ) and 

ARITHMETIC MEAN OF MEMORY CONSUMPTION (M a ) FOR THE NATIVE 
ENGINES 

SesamejjB Virtuoso 

T a [s] T g [s] M a [MB] | T a [s] T g [s] M a [MB] 

250k I 639.86 6.79 73.92 546.31 1.31 377.60 
1M 653.17 10.17 145.97 850.06 3.03 888.72 
5M | 860.33 22.91 196.33 870.16 8.96 1072.84 



surprisingly, also the memory consumption of the native 
engines Virtuoso and Sesame db increased with document size. 

Arithmetic and Geometric Mean. For the in-memory en- 
gines we observe that Sesame^ is superior to ARQ regarding 
both means (see Table IVfl , For instance, the arithmetic (T a ) 
and geometric (T g ) mean for the engines on the 1AI document 
over all queries 18 are T a SesM = 683.16s, T^ esM = 106.84s, 
t arq = 901.73s, and T* R Q = 179.42s. 

For the native engines on 1M triples (cf. Table lVIll ) we have 

T SesDB = 65 3J 7Si rpSesDB = 10 .17 S; J^Virt = 850 .06s, 

and Tp H = 3.03s. The arithmetic mean of Sesame db is 
superior, which is mainly due to the fact that it failed only on 
4 (vs. 5) queries. The geometric mean moderates the impact 
of these outliers. Virtuoso shows a better overall performance 
for the success queries, so its geometric mean is superior. 

In-memory Engines. Figure [5] (top) plot selected results 
for in-memory engines. We start with Q5a and Q5b. Although 
both compute the same result, the engines perform much better 
for the explicit join in Q5b. We may suspect that the implicit 
join in Q5a is not recognized, i.e. that both engines compute 

18 We always penalize failure queries with 3600s. 



the cartesian product and apply the filter afterwards. 

Q6 and Q7 implement simple and double negation, re- 
spectively. Both engines show insufficient behavior. At the 
first glance, we might expect that Q7 (which involves double 
negation) is more complicated to evaluate, but we observe 
that SesameM scales even worse for Q6. We identify two 
possible explanations. First, Q7 "negates" documents with 
incoming citations, but - according to Section IIII-DI - only 
a small fraction of papers has incoming citations at all. In 
contrast, Q6 negates arbitrary documents, i.e. a much larger 
set. Another reasonable cause might be the non-equality filter 
subexpression ?yr2 < ?yr inside the inner FILTER of Q6. 

For Ask query Q\2a both engines scale linearly with 
document size. However, from Table [V] and the fact that our 
data generator is incremental and deterministic, we know that 
a "witness" is already contained in the first lOfc triples of the 
document. It might be located even without reading the whole 
document, so both evaluation strategies are suboptimal. 

Native Engines. The leftmost plot at the bottom of Figure 
shows the loading times for the native engines Sesame db and 
Virtuoso. Both engines scale well concerning usr and sys, 
essentially linear to document size. For Sesame db, however, 
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Fig. 5. Results for in-memory engines (top) and native engines (bottom) on Sl=10k, S2=50k, S3=250k, S4=1M, S5=5M, and S6=25M triples 



tme grows superlinearly (e.g., loading of the 25M document 
is about ten times slower than loading of the 5M document). 
This might cause problems for larger documents. 

The running times for Q2 increase superlinear for both 
engines (in particular for larger documents). This reflects the 
superlinear growth of inproceedings and the growing result 
size (cf. Tables [VTIIl and[V]i. What is interesting here is the sig- 
nificant difference between usr+sys and tme for Virtuoso, 
which indicates disproportional disk I/O. Since Sesame does 
not exhibit this peculiar behavior, it might be an interesting 
starting point for further optimizations in the Virtuoso engine. 

Queries Q3a and Q3c have been designed to test the intel- 
ligent choice of indices in the context of FILTER expressions 
with varying selectivity. Virtuoso gets by with an economic 
consumption of usr and sys time for both queries, which 
suggests that it makes heavy use of indices. While this strategy 
pays off for Q3c, the elapsed time for Q3a is unreasonably 
high and we observe that Sesame m scales better for this query. 

Q10 extracts subjects and predicates that are associated with 
Paul Erdds. First recall that, for each year up to 1996, Paul 
Erdds has exactly 10 publications and occurs twice as editor 
(cf. Section UVb - Both engines answer this query in about 
constant time, which is possible due to the upper result size 
bound (cf. Table W\. Regarding usr+sys, Virtuoso is even 
more efficient: These times are diminishing in all cases. Hence, 
this query constitutes an example for desired engine behavior. 

VII. Conclusion 

We have presented the SP 2 Bench performance benchmark 
for SPARQL, which constitutes the first methodical approach 
for testing the performance of SPARQL engines w.r.t. differ- 
ent operator constellations, RDF access paths, typical RDF 
constructs, and a variety of possible optimization approaches. 

Our data generator relies on a deep study of DBLP Al- 
though it is not possible to mirror all correlations found in the 
original DBLP data (e.g., we simplified when assuming inde- 
pendence between attributes in Section IIII-Ab . many aspects 



TABLE VIII 
Characteristics of generated documents 



#Triples 


10k 


50k 


250k 


1M 


5M 


25M 


file size [MB] 


1.0 


5.1 


26 


106 


533 


2694 


data up to 


1955 


1967 


1979 


1989 


2001 


2015 


#Tot .Auth. 


1.5k 


6.8k 


34.5k 


151.0k 


898.0k 


5.4M 


#Dist .Auth. 


0.9k 


4.1k 


20.0k 


82.1k 


429.6k 


2.1M 


#Journals 


25 


104 


439 


1.4k 


4.6k 


11.7k 


#Ar t i c 1 e s 


916 


4.0k 


17.1k 


56.9k 


207.8k 


642.8k 


#Proc . 


6 


37 


213 


903 


4.7k 


24.4k 


#Inproc . 


169 


1.4k 


9.2k 


43.5k 


255.2k 


1.5M 


#Incoll . 


18 


56 


173 


442 


1.4k 


4.5k 


#Books 








39 


356 


973 


1.7k 


#PhD Th. 











101 


237 


365 


#Mast.Th. 











50 


95 


169 


#WWWs 











35 


92 


168 



are modeled in faithful detail and the queries are designed in 
such a way that they build on exactly those aspects, which 
makes them realistic, understandable, and predictable. 

Even without knowledge about the internals of engines, we 
identified deficiencies and reasoned about suspected causes. 
We expect the benefit of our benchmark to be even higher for 
developers that are familiar with the engine internals. 

To give another proof of concept, in [33] we have suc- 
cessfully used SP 2 Bench to identify previously unknown lim- 
itations of RDF storage schemes: Among others, we iden- 
tified scenarios where the advanced vertical storage scheme 
from [12] was slower than a simple triple store approach. 

With the understandable DBLP scenario we clear the way 
for coming language modifications. For instance, SPARQL 
update and aggregation support are currently discussed as 
possible extensions. 19 Updates, for instance, could be real- 
ized by minor extensions to our data generator. Concerning 
aggregations, the detailed knowledge of the document class 
counts and distributions (cf. Section ITU]) facilitates the design 
of challenging aggregate queries with fixed characteristics. 



See http://esw.w3.org/topic/SPARQL/Extensions. 
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Appendix 



SELECT ?yr 
WHERE { 

? journal rdf: type bench : Journal . 
?journal dc:title "Journal 1 (1940) 
? journal dcterms : issued ?yr } 



Qi 



"xsd: string . 



SELECT ?inproc ?author ?booktitle ?title 
?proc ?ee ?page ?url ?yr ?abstract 

WHERE { 

?inproc rdf : type bench : Inproceedings . 
?inproc dc:creator ?author. 
?inproc bench :booktitle ?booktitle . 
?inproc dc : title ?title. 
?inproc dcterms : partOf ?proc. 
?inproc rdfs:seeAlso ?ee. 
?inproc swrc :pages ?page . 
?inproc f oaf : homepage ?url . 
?inproc dcterms : issued ?yr 

OPTIONAL { ?inproc bench : abstract ?abstract 
} ORDER BY ?yr 



Q2 



(a) SELECT ?article 
WHERE { ?article rdf:type bench : Article . 

?article ?property ?value 
FILTER ( ?property=swrc : pages ) } 

(b) Q3a, but "swrc:month" instead of "swrc:pages 

(c) Q3a, but "swrc: isbn" instead of "swrc: pages " 



Q3 



SELECT DISTINCT ?namel ?name2 

WHERE { ?articlel rdf : type bench : Article . 

?article2 rdf : type bench : Article . 
?articlel dc : creator ?authorl . 
?authorl foaf:name ?namel . 
?article2 dc : creator ?author2 . 
?author2 foaf:name ?name2 . 
?articlel swrc: journal ? journal. 
?article2 swrc: journal ? journal 
FILTER ( ? name 1 < ? name 2 ) } 



Q4 



(a) SELECT DISTINCT ?person ?name 

WHERE { ?article rdf:type bench : Article . 
?article dc:creator ?person. 
?inproc rdf : type bench : Inproceedings . 
?inproc dc:creator ?person2 . 
?person foaf:name ?name. 
?person2 foaf :name ?name2 
FILTER (?name=?name2) } 

(b) SELECT DISTINCT ?person ?name 

WHERE { ?article rdf:type bench : Article . 
?article dc : creator ?person . 
?inproc rdf : type bench : Inproceedings . 
?inproc dc:creator ?person. 
?person foaf: name ?name } 



Q5 



SELECT ?yr ?name ?doc 
WHERE { 

?class rdf s : subClassOf foaf : Document . 

?doc rdf:type ?class. 

?doc dcterms : issued ?yr. 

?doc dc: creator ?author. 

?author foaf: name ?name 

OPTIONAL { 

?class2 rdf s : subClassOf foaf : Document . 

?doc2 rdf:type ?class2. 

?doc2 dcterms : issued ?yr2 . 

?doc2 dc: creator ?author2 

FILTER ( ?author=?author2 && ?yr2<?yr) } 
FILTER ( !bound(?author2) ) } 



Q6 



SELECT DISTINCT ?title 
WHERE { 

?class rdf s : subClassOf foaf : Document . 

?doc rdf:type ?class. 

?doc dc:title ?title. 

?bag2 ?member2 ?doc. 

?doc2 dcterms : references ?bag2 

OPTIONAL { 

?class3 rdf s : subClassOf foaf : Document . 

?doc3 rdf:type ?class3. 

?doc3 dcterms : references ?bag3 . 

?bag3 ?member3 ?doc 

OPTIONAL { 

?class4 rdf s : subClassOf foaf : Document . 
?doc4 rdf:type ?class4. 
?doc4 dcterms : references ?bag4 . 
?bag4 ?member4 ?doc3 } 
FILTER ( Ibound (?doc4) ) } 
FILTER ( !bound(?doc3) ) } 



Q7 



SELECT DISTINCT ?name 
WHERE { 

?erdoes rdf:type foaf :Person. 
?erdoes foaf:name "Paul Erdoes" 
( ?doc dc: creator ?erdoes . 
?doc dc: creator ?author. 
?doc2 dc: creator ?author. 
?doc2 dc: creator ?author2. 
?author2 foaf: name ?name 
FILTER (?author !=?erdoes && 
?doc2!=?doc && 
?author2 ! =?erdoes && 
?author2 ! =?author ) 

} UNION { 

?doc dc: creator ?erdoes. 
?doc dc: creator ?author. 
?author foaf: name ?name 
FILTER (?author !=?erdoes) } } 



Q8 



'xsd: string . 



SELECT DISTINCT ?predicate 
WHERE { 

{ ?person rdf : type foaf:Person. 

?subject ?predicate ?person } UNION 
{ ?person rdf : type foaf :Person. 

?person ?predicate ?object } } 



Q9 



SELECT ?subj ?pred 

WHERE { ?subj ?pred person : Paul_Erdoes 



Q10 



SELECT ?ee 

WHERE { ?publication rdfs:seeAlso ?ee } 
ORDER BY ?ee LIMIT 10 OFFSET 50 



Qll 



Q12 



(a) Q5a as ASK query 

(b) Q8 as ASK query 

(c) ASK {person : John_Q_Public rfd:type foaf: Person} 
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6. Query evaluation results on Sl=10k. 
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